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ABSTRACT 

A set of 1438 human exons was subjected to nested 
PCR. The initial success rate using a standard PCR 
protocol required for iigatlon-lndependent cloning 
was 83.4%. Logistic regression analysis was con- 
ducted on 27 primer- and template-related charac- 
teristics, of which most could be ignored apart from 
those related to the GC content of the template. 
Overall GC content of the template was a good 
predictor for PCR success; however, specificity and 
sensitivity values for predicted outcome were 
improved to 84.3 and 94.8%, respectively, when 
regionalized GC content was employed. This repre- 
sented a significant improvement in predictability 
with respect to GC content alone (P< 0.001; x^) and 
Is expected to Increase In relative sensitivity as tem- 
plate size Increases. Regionalized GC was calcu- 
lated with respect to a threshold of 61% GC content 
and a sliding window ^o^ 
sequence. Eineliynln^ 
fKactjcabi^ fpi^ all Iterget^ 

iarg^umber of and GC 

cpntentlre to be ai^ In parallel, particularly If 
total open reading frame or domain coverage is 
essential for recombinant protein synthesis. Thus, 
the present method is proposed as a means of 
grouping subsets of genes possessing potentially 
difficult target sequences so that PCR conditions 
can be optimized separately In order to obtain 
improved outcomes. 



INTRODUCTION 

The recent mapping of the human, mouse, fly, yeast and other 
genomes has paved the way to an era of massive intra- and 
inter-genomic comparisons. In parallel, biomedical research 
laboratories, biotechnology and pharmaceutical companies 
have developed high-throughput methods for genomic and 
proteomic applications (1). Many of these methods depend 
upon amplification of nucleic acids by PCR (2-5). 



PCR requires a DNA template and a pair of primers flanking 
the target DNA. An important parameter to be considered 
when selecting PCR primers is the ability of the primers to 
form a stable duplex exclusively with the specific site on the 
target DNA. The use of the nearest-neighbor thermodynamic 
parameters for computing DNA or RNA duplex stability has 
been shown to produce reliable predictions (6-9). These 
methods calculate the melting temperature (T^) of the 
primers, which is cpmlated^with the GO ratio of the^ 
primers. Typically, priniers^^^ 
t6 or Higher ^tlian^^'j^^ 

considerations teaf iin^^ the specificity of PCR include: 
(i) avoidance of complementarity at the 3' termini of the 
primers, as this promotes the formation of primer dimer 
artifacts; and (ii) avoidance of stable self-complementary 
hairpin loops that increase primer stability (10). 

The DNA template used for PCR is often overlooked when 
compared with the effort put into primer design. The most 
commonly used parameters that relate to the DNA template 
are the PCR. product size and the T^.of the product (10-12). 




PCS ha§ fee^^^ in vitro jpj¥ocesTl[TiSJ. 

Many tools exist that help to achieve a high yield of PCR 
products, such as primer design software (12,17), optimization 
kits and well-characterized protocols (18,19). However, these 
tools are often designed for a small number of reactions, or 
indeed a specific gene whereby the temperature and/or ion 
concentrations are varied to achieve iriaximal recovery of 
desired product (18), This is not feasible when hundreds of 
genes are to be amplified in parallel. 

Several recent studies have evaluated the success of primer 
extension for genotyping (2,20) and for generation of gene 
sequence tags (21). Vieux et aL (2) reported a 96% success 
rate in PCR using a very strict primer selection strategy 
combined with stringent PCR conditions for analysis of single 
nucleotide polymorphisms. These applications have the 
luxury of scanning long nucleotide sequences until the optimal 
primers are found. However, amplifying a particular DNA 
sequence of interest does not usually allow a stringent primer 
selection strategy, especially if the target sequence is a few 
hundred base pairs in length or contains the whole open 
reading frame (ORF) or specific portions of it for recombinant 
protein synthesis (22-25). The latter are thought to becoine 
increasingly important in a proteomics context. 
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Here we report on the amplification of 1438 human exons 
and efforts to establish a suitable predictor of PCR outcome. 

MATERIALS AND METHODS 

Selection of exons 

We randomly selected 1438 human ORFs firom disease-related 
genes available in publicly accessible clone libraries in late 
2001 and retrieved their DNA coding sequence from GenBank 
(http://www.ncbi.nlm.nih.gov). Coding sequences were com- 
pared with the human genome (Genbank build 25) using 
BLAST (26), and the exons were extracted and set in-frame. 
For ORFs containing multiple exons, the first was discarded to 
reduce the likelihood of a signal protein, and from the 
remaining exons the longest was chosen. 

Primer design strategy 

We selected by default the first and last 21 nucleotides of each 
target sequence as the primers and modified each primer only 
if more than four Gs or four Cs were present in the last five 
nucleotides of the 3' end, or if more than three consecutive Ts 
were present at the 3' end. In such cases, up to five nucleotides 
were removed from the 3' end, allowing a minimum primer 
length of 16 nucleotides. This study was conducted with a 
view to subsequent cloning in the Gateway™ system 
(Invitrogen). Therefore, two long adaptors, named attBl and 
attB2, had to be attached to both sides of the PCR product in a 
two-step procedure. First, an oligonucleotide of 14 bases was 
attached to the 5' end of the forward primer (AAAAAGCA- 
GGCTTG) and an oligonucleotide of 13 bases was attached to 
the 5' end of the reverse primer (AGAAAGCTGGGTA). 
Secondly, two universal primers were employed that bound to 
the adaptors from the first PCR. The forward universal primer 
GGGGACAAGnTGTACAAAAAAGCAGGCTTG and the 
reverse universal primer GGGGACCACTTTGTACAAGA- 
AAGCTGGGTA were used to complete the attBl and attB2 
site. All primers were synthesized by Sigma Genosys. 

PCR 

Genomic DNA was isolated from purified human white blood 
cells using a Genomic tip™ 500/g from Qiagen. A two-step 
PCR was performed in 96-well plates with a GeneAmp PCR 
system 9700 from Applied Biosystems. The standard PCR 
conditions were: 0.1 ^g of template DNA, 0.05 |il of TaKaRa 
Ex Taq, 1 jil of lOX Ex Taq buffer (2 mM Mg2+), 0.8 ^1 of 
dNTP mixture (2.5 mM each) and 0.5 of each primer in a 
10 \i\ reaction mixture. In all PCR cycles, denaturation lasted 
30 s at 94''C and polymerization 2 min at 72°C. The annealing 
step was for 30 s at varying temperatures, namely 58°C in the 
first PCR and 45°C for five cycles followed by 65T for 25 
cycles in the second PCR. PCR products were visualized with 
0.5 \i\lm\ ethidium bromide on a 1.2% agarose gel. Images 
were taken using GeneGenious from Syngene® and analyzed 
with the bundled GeneTools software. 

Logistic regression 

A stepwise backward likelihood ratio (LR) logistic regres- 
sion was performed with SPSS version 10. Entry and 
removal P-values were set to <0.05. The receiver operating 
characteristic (ROC) curve was used as a measure of model 



performance. It was employed graphically to represent the 
trade-off between false-positive and false-negative rates for 
every possible cut off. The false-positive rate was plotted on 
the A:-axis and the true positive rate (1 - the false-negative rate) 
on the 3;-axis. The area under the curve was of primary mterest 
as it measured the correlation between the category predicted 
by the test and the true category into which the case falls 
(27,28). 

Informatics 

Software for sequence analysis of primers and DNA template 
was written in Python (http://www.python.org), and all data 
and results were stored in a FileMaker database (http:// 
www.filemaker.com). SPSS software version 10 was used for 
data analysis and statistical modeling. The parameters 
employed for the study of primers and DNA template are 
summarized in Table 1 . In all statistical tests, the primers were 
labeled 1 and 2 according to their GC content. Primer 1 is the 
primer with the higher GC content of the two primers and not 
necessarily the forward primer. 

Regionalized GC content within template DNA was 
calculated using a sliding window of 21 nucleotides, shifted 
one nucleotide at a time. The results were plotted and the area 
under the GC curve (AUCgc) above a 61% threshold was 
calculated using the trapezoid method (Fig. 1). A high GC 
content region was considered significant if it was >61% for 
' 10 consecutive windows. Similarly, regionalized and the 
area under the curve (AUCxm) above a threshold of 74''C 
were calculated. The thresholds for both the GC curve and the 
Tni curve were chosen initially as 65% and 75°C so as to reflect 
population extremes. Subsequently, these threshold values 
were made more precise with respect to their ability to 
discriminate between 'good' and ^failed' groups for all integer 
values between 50 and 70% and between 65 and 85°C, 
respectively, while employing the LR logistic regression. 
Table 1 summarizes the metiiods and parameters employed for 
statistical analysis, while associated software is available 
from: http://wwwcrac.pharm.uu,nl/moret/pub/benita. 



RESULTS 

Primers and template properties 

Out of 2876 primers, 2501 were not altered, while 1-4 
nucleotides were removed from the 3' end of the remaining 
375. Properties employed for primer evaluation arc shown in 
Table 2. A wide range of values was allowed for each primer 
property, yet average values of 7^, GC content and internal 
stability were well within the recommended range, as defined 
by Rychlik (10,29) and McPherson and M0ller (1 8). However, 
when combined parameters are examined, for example, 7^ 
and 3' end internal stability together, results are less clear-cut. 
In the latter example, only 60% of the primers were within the 
recommended range and only 37% of the primer pah's were 
both within the range. Primers had 4.8 significant hits, on 
average, when compared with the human genome using 
BLAST. A search of the entire human genome for potential 
PCR products that required the primers to be on opposite 
strands and not more than 2000 nucleotides apart predicted 
that only one PCR product could be formed by in silico 
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Table 1. Description of parameters used for analyzing PGR primers and template 



Parameter name Description 



Reference 



j^primcr Melting temperature of the primers 

IntStabS Internal stability of the primer at the 3' end 

IntStabS Internal stability of the primer at the 5' end 

IntStabD IntStabS - IntStab3 

GCprimer Primer (G + CVprimer length 

GC_DIFF GCprimc/GQemptatc .... • act 

SigHits Number of significant hits when comparing the primer sequence with the human genome usmg BLA5> I 

A blast-hit was considered significant when 10 identical nucleotides occurred at the 3' end 

Bend Bending value at the 3' end of the primer 

Curve Curvature value at the 3' end of the primer 

SelfAny The maximum local alignment score when testing a primer for annealing with itself or with the other primer 
Computed by Primer3. 

SelfEnd The maximum 3'-anchored global alignment score when testing a primer for annealing with itself or with the 

other primer. Computed by Primer3 

j'^product Melting temperature of the PGR template 

GCxcmpiate Template (G + C)/template length 

Tn^Diff IT-^primcrl _ 7'^primer2| 

7;0pi Optimal temperature for PGR 

Dimer Highest possible duplex stabiHty between both primers, calculation based on finee energy values. 

MaxCurve Highest DNA curvature in the PGR template. 

AUCrm Area under the curve and above 75**G of the PGR template 

AUCoc Area under the GC curve and above 65% of the PGR template 

ratiooc Number of GC windows with values above 65% divided by the length of the PCR template 

ratiorm Number of windows with values above 75"C divided by the length of the PGR template 

NormAUGoc Ratiooc X AUCoc 

NormAUCxm Ratiorm x AUCxm 

MinDist Shortest distance from either ends of the PCR template and the first high GC region 

MINgc Low value of the GC content of the first and last 60 nucleotides of the PCR template. 

MAXgc High value of the GC content of the first and last 60 nucleotides of the PGR template. 

SIZE PGR product length 



(7,10) 
(7,10) 
(7,10) 



EMBOSS, banana (31) 
EMBOSS, banana (31) 

(12) 

(12) 
(32) 



(17) 

(7) . 

EMBOSS, banana (31) 




Figure 1. A slide window of GC content plotted against the DNA sequence of an example template. The black regions represent areas above a 61% threshold 
for GC content and measured across 31 bp, i.e. X 10 sliding windows of 21 bp. 



predictions. In silico, all pairs of primers generated a single 
target PCR product. A combined analysis of both primers and 
the PCR template was performed to evaluate the success of the 



reaction as shown in Table 3. The observed range was very 
wide for most of the parameters. Analysis of DNA curvature 
was included to identify DNA structural oddities that might 
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Table 2. Properties of the 2876 primers employed 

TnCQ GCcontem GCDiff IntStabS IntStabS IntStabD SelfAny SelfEnd 

(%) (kcal/mol) (kcal/mol) (kcal/mol) (kcal/mol) (kcal/mol) 



Average 60.6 ± 7.12 49.0 ± 11.3 0.95 ± 0.18 7.6 ± 1.2 7.83 ± 1.35 0.23 ± 1.73 5.1 ± 1.9 2.6 ± 2.2 

Range 36.1-86.8 14.0-86.0 0.34-1.73 5-13.1 4.8-13 -5.17-6.3 0-14 0-12 

Recommended 55-70 30-70 >1 <9 NA >0 <8 <3 



Table 3. Template and primer properties 





t; ^ CO 


J^Diffm 


Lowest 
CSCDiff 


AG dimer 
(kcal/mol) 


Tjn product 
TO 


GC content 
(%) 


DNA max 
curvature O 


Average 
Range 

Recommended 


56.5 ± 4.8 

32.8-69.6 

58 


7.3 ± 5.7 
0.01-32.8 

^5 


0.8 ± 0.2 
0-1.4 
>1 


4.5 ± 7.0 
0-30.5 
^12 


78.2 ± 3.9 

68.2-89.6 

None 


51^9 

31.2-77.1 
30-70 


32.4 

9.7-120.4 
None 



have affected the ability of Tag polymerase to duplicate the 
template. 

PCR 

A PCR product with an expected size >350 bp was considered 
•good' if the observed band was 10% longer or shorter than the 
expected size. A maximal deviation of 15% was allowed for 
smaller products, due to the inherent insensitivity of on-gel 
mobility measurements. All bands <120 bp were discarded 
and interpreted as representing primer dimers or PCR artifacts. 
A band of the expected size was observed for 1226 (83.4%) 
sequences. The other 212 (14.7%) failed in duplicate experi- 
ments to produce the correct band size, 69 (32.5%) had no 
product at all and 44 (20.7%) were associated with a product 
of incorrect size. Of all the *good' products, 858 (70%) had 
one clear band and the other 305 (25%), 54 (4.4%) and eight 
(0.75%) had two, three and four bands, respectively. 

Numerical analysis 

Two data sets were created for the analysis: data set A 
contained 212 sequences that failed to PCR twice and data set 
B contained 318 sequences, which twice produced a clear 
visible band. We avoided including too many samples in data 
set B since it could bias the statistical analysis. Seventy 
percent of the data in each set was used for statistical analysis 
as selected by a random function, while the remainder was 
used as a test set for the prediction model. The mean values of 
groups A and B for the parameters described in Table 1 were 
compared using a one-way ANOVA. The ANOVA results 
showed that and GC content are the most significant 
parameters at the primer level, and parameters that are 
correlated with total GC! content of the template are the most 
significant at the template level (Fig. 2). All the primer and 
template parameters were used in a stepwise backward LR 
logaridimic regression. Although total GC content, GC ratio, 
j^Product and Ta^* are the most significant parameters in the 
ANOVA test, NormAUCcc an<^ NormAUCxm were shown to 
be much better predictors in the logistic regression. A logistic 



regression for each variable was performed separately, and the 
goodness of fit was assessed by -2 log likelihood. The 
strongest single predictor of success and failure of PCR using 
the logistic regression model was NormAUCcc (Fig- 3). The 
best logistic regression model contained both the primer with 
lower GC content (GCpnmera) and the NormAUCcc- Wald 
values for GCpriiner2» NormAUCcc and the constant were 19.2, 
40.1 and 38.2, respectively, each with similar degrees of 
freedom. Thus, a good level of confidence was obtained for the 
expected PCR success, as shown by the following equation: 

^4.9 ~ 7.6 X GCprimer: - 0.004 x NoraiAUCoc 

p = _J! '. 

1+ ^4.9 - 7.6 X GCprinKra - 0.004 x NormAUCcc ' 



where P is the probability of a successful PCR and GCprimer2 
and NormAUCcc are the parameters described in Table 1 . The 
area under the ROC curve was 0.87, and the Nagelkerke 
was 0.58. Both are high and suggest the model's performance 
is good. A PCR is predicted successful for P ^ 0.5. Using this 
equation on our test set (n = 160, i.e. 30% of data set A + B), 
86.3% of PCRs were predicted correctly. The sensitivity of the 
model is the probability of correctly predicting a positive 
example, while the specificity is the probability that a positive 
example is correct (30). The specificity and sensitivity values 
of the test set were 84.3 and 94.8%, respectively; and 94.9 and 
85.2%, respectively for the 1438 PCRs examined in this study. 

The logistic regression equation was able to predict that, for 
a given value of NormAUCoci a reduction of the GC content 
of the primer should increase the probability of PCR success. 
This was due to the significantly lower GCprimer2 of group B 
when compared with that of group A, at NormAUCcc ^ 340. 
The mean values of GCpnmci^ for NormAUCcc 340 for 
groups A and B were 48 ± 10% and 44 ± 9% (P < 0.05), 
respectively, corresponding to a 2°C difference in mean 
3r^primcr2 values. Nevertheless, for NormAUCcc > 340, the 
probability of PCR success was significantly reduced even for 
low CJCprimera valucs, since 67.2% of the sequences in group A 
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Figure 2. F- values for a one-way ANOVA computed from a learning set of 370 DNA target templates. Large /"-values represent a significant difference 
between the failed and successful PGR. /» = 0.05 for F = 3.87 and P = 0.001 for F = n. 




Figure 3. Likelihood values -2 log obtained by logistic regression for each variable separately. These values are commonly used to indicate the goodness of 
fit The lower the value, the better the model. 



possessed a NormAUCcc > 340 compared with only 5.2% for 
group B. Therefore, PGR success was also predicted based on 
NormAUCcc alone with an upper threshold of 340. The 
complete set of 1438 sequences was divided into three groups. 



sequences with NormAUCcc ^ 340; NormAUCcc between 
340 and 750 (95th percentile of successful PGR); and 
NormAUCcc > 750. In the first category (n = 1 143), the 
success rate was 93.8%, while in the second (n = 139) and 
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third (n = 156) it was 71.2 and 35.2%, respectively. Thus, the 
high predictive value of the index, NonnAUCoc 340. was 
clearly demonstrated. 

DISCUSSION 

For more than a decade, primer design has evolved into an 
efficient and mature science. Although the primer sequence 
can be modified by biologists, the target DNA cannot. 
Therefore, little attention is usually paid to the analysis of 
the PGR template prior to experimental procedures, i.e. apart 
from its relevance to primer design. Faced with the challenge 
of large-scale PGR, protocol optimization becomes increas- 
ingly important, yet is increasingly problematic due to the 
variation in template sequence and length. As a result, overall 
PGR success rates can be compromised. In this study, we 
found regionalized GC content to be a good predictor of PGR 
success across multiple templates. Indeed, any parameter able 
to be correlated with the GC content of the PGR template, such 
as Tt^"^ and TP^\ was statistically significant when PGR 
success and failure were compared. However, NomxAUGoc 
was seen to be a better predictor for PGR success (P < 0.001 ; 
y}) for the total data set of 1438 PGRs. NormAUGcc was 
much more sensitive to fluctuations in GC content than other 
methods that simply relied on averaging overall GG. The 
performance of this predictor is expected to improve as 
template size increases due to the greater likelihood of 
problematic regions occurring within a given template. 

The evidence presented here would suggest that the primer 
was most often not the cause of PCR failure, but rather the 
template, i.e. because all primers met similar stringency 
demands. In all cases, the average values of 63 and 52% for 
GGprimeri and GGprinier2 of failed PGRs were acceptable, even 
if the most stringent primer design criteria were employed (2), 
Furthermore, Rychlik et at. (11) showed that primer design 
was significant for a low number of PGR cycles, while this 
diminished after 25 cycles. When employing nested primer 
PGR of 30 cycles for each reaction, as here for ligation- 
independent cloning experiments, strong amplification can be 
expected to depend less upon stringent primer design due to 
the addition of 14 and 13 bases to the 5' ends of the specific 
forward and reverse primers, respectively, and the associated 
increase in affinity of primers for template. Thus, provided 
obvious homologies and self-annealing attributes of primers 
are minimized, then most 20mer strings will be associated 
with sufficient target specificity. As a consequence, our results 
would suggest that more effort should be put towards analysis 
of the PCR template. Stringent primer design might result in 
high amounts of very pure PCR product, but it comes at the 
expense of sequence coverage. The latter is most important 
when entire ORFs or domains are being targeted and template 
coverage is essential for recombinant protein synthesis. 

For templates possessing a NormAUCGc > 340, it is 
predicted that the success of PCR will be more dependent on a 
suitable protocol than on primer selection of primers. 
Therefore, when faced with the task of large-scale PGR, we 
recommend dividing the samples into three groups and 
subsequently optimizing the PGR for successful outcomes in 
each of the following categories: sequences with 
NormAUGcc ^ 340; NormAUGcc between 340 and 750; 
and NormAUGcc ^ 750. The first group is likely to be 



successfully amplified using standard PGR protocols, and as a 
result primer stringency may be relaxed without deleterious 
effects, thereby allowing maximal target sequence coverage. 
The second and third groups should each be optimized in turn, 
with increasing attention being given to protocol and primer 
design. 

In summary, the NormAUGcc of a PCR template was 
found to represent a more sensitive predictor of PGR outcome 
than parameters previously described, while its predictive 
value as an improvement on GG content alone is likely to 
increase concomitantly with template size. Although the 
learning set examined during this study was derived from 
nested primer PCR, the index, NormAUGcc. is expected to 
maintain its relevance for standard PGR experiments, as most 
primer-related failures probably occur during , initial cycles 
only. 
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